Methods for spectral analysis and their applications: spectral replacement

ABSTRACT

This invention pertains to methods for the analysis of biological data, particularly spectra, for example, nuclear magnetic resonance (NMR) and other types of spectra. More specifically, the present invention pertains to a method for processing a sample spectrum comprising: replacing each of one or more target regions in said sample spectrum with a corresponding replacement region of a master control spectrum to give a target-replaced sample spectrum, wherein said replacement region has been scaled so as to have the same fraction of the total integrated intensity in said target-replaced sample spectrum as it did in said master control spectrum. The present invention also pertains to analysis methods which employ the methods of the present invention, such as methods of identifying a biomarker or biomarker combination for an applied stimulus; classification of an applied stimulus; diagnosis of an applied stimulus; therapeutic monitoring of a subject undergoing therapy; evaluating drug therapy and/or drug efficacy; detecting toxic side-effects of drug; characterizing and/or identifying a drug in overdose.

TECHNICAL FIELD

[0001] This invention pertains generally to the field of chemometrics,metabonomics, and, more particularly, to methods for the analysis ofchemical, biochemical, and biological data, for example, spectra, forexample, nuclear magnetic resonance (NMR) and other types of spectra.

BACKGROUND

[0002] Significant progress has been made in developing methods todetermine and quantify the biochemical processes occurring in livingsystems. Such methods are valuable in the diagnosis, prognosis andtreatment of disease, the development of drugs, as well as for improvingtherapeutic regimes for current drugs.

[0003] Diseases of the human or animal body (such as cancers,degenerative diseases, autoimmune diseases and the like) have anunderlying basis in alterations in the expression of certain genes. Theexpressed gene products, proteins, mediate effects such as abnormal cellgrowth, cell death or inflammation. Some of these effects are causeddirectly by protein-protein interactions; other are caused by proteinsacting on small molecules (e.g. “second messengers”) which triggereffects including further gene expression.

[0004] Likewise, disease states caused by external agents such asviruses and bacteria provoke a multitude of complex responses ininfected host.

[0005] In a similar manner, the treatment of disease through theadministration of drugs can result in a wide range of desired effectsand unwanted side effects in a patient.

[0006] At the genetic level, methods for examining gene expression inresponse to these types of events are often referred to as “genomicmethods,” and are concerned with the detection and quantification of theexpression of an organism's genes, collectively referred to as its“genome,” usually by detecting and/or quantifying genetic molecules,such as DNA and RNA. Genomic studies often exploit a new generation ofproprietary “gene chips,” which are small disposable devices encodedwith an array of genes that respond to extracted mRNAs produced by cells(see, for example, Klenk et al., 1997). Many genes can be placed on achip array and patterns of gene expression, or changes therein, can bemonitored rapidly, although at some considerable cost.

[0007] However, the biological consequences of gene expression, oraltered gene expression following perturbation, are extremely complex.This has led to the development of “proteomic methods” which areconcerned with the semi-quantitative measurement of the production ofcellular proteins of an organism, collectively referred to as its“proteome” (see, for example, Geisow, 1998). Proteomic measurementsutilise a variety of technologies, but all involve a protein separationmethod, e.g., 2D gel-electrophoresis, allied to a chemicalcharacterisation method, usually, some form of mass spectrometry.

[0008] In recent years, it has been appreciated that the reaction ofhuman and animal subjects to disease and treatments for them can varyaccording to the genomic makeup of an individual. This has led to thedevelopment of the field of “pharmacogenomics.” A fuller understandingof how an individual's own genome reacts to a particular disease willallow the development of new therapies, as well as the refinement ofexisting ones.

[0009] At present, genomic and proteomic methods, which are bothexpensive and labour intensive, have the potential to be powerful toolsfor studying biological response. The choice of method is stilluncertain since careful studies have sometimes shown a low correlationbetween the pattern of gene expression and the pattern of proteinexpression, probably due to sampling for the two technologies atinappropriate time points (see, e.g., Gygi et al., 1999). Even incombination, genomic and proteomic methods still do not provide therange of information needed for understanding integrated cellularfunction in a living system, since they do not take account of thedynamic metabolic status of the whole organism.

[0010] For example, genomic and proteomic studies may implicate aparticular gene or protein in a disease or a xenobiotic response becausethe level of expression is altered, but the change in gene or proteinlevel may be transitory or may be counteracted downstream and as aresult there may be no effect at the cellular and/or biochemical level.Conversely, sampling tissue for genomic and proteomic studies atinappropriate time points may result in a relevant gene or protein beingoverlooked,

[0011] Nonetheless, recent advances in genomics and proteomics nowpermit the rapid identification of new potential targets for drugdevelopment. With a new target in hand, and with the aid ofcombinatorial chemistry and high throughput screening, thepharmaceutical industry is capable of rapidly generating and screeningthousands of new candidate compounds each week.

[0012] However, in practice, only a few of these candidate compoundswill be taken further, for example, into pre-clinical and clinicaldevelopment. It is therefore critical to identify those candidatecompounds with the most promise, and this is usually judged by efficacyand toxicology, before selection for clinical studies. However, theseselection processes are imperfect and many drugs fail in clinical trialsdue to lack of efficacy and/or toxicological effects. It is alsopossible that other drugs may fait overall because they are onlyeffective in a subgroup of patients who have an unrecognisedpharmacogenomic response. There is a great need to find new ways ofreducing this compound “attrition” or losses of drugs late in thedevelopment process, for example, through the development andapplication of analytical technologies designed to maximise efficiencyof compound selection and to minimise attrition rates.

[0013] While genomic and proteomic methods may be useful aids incompound selection, they do suffer from substantial limitations. Forexample, while genomic and proteomic methods may ultimately giveprofound insights, into toxicological mechanisms and provide newsurrogate biomarkers of disease, at present it is very difficult torelate genomic and proteomic findings to classical cellular orbiochemical indices or endpoints. One simple reason for this is thatwith current technology and approach, the correlation of thetime-response to drug exposure is difficult. Further difficulties arisewith in vitro cell-based studies. These difficulties are particularlyimportant for the many known cases where the metabolism of the compoundis a prerequisite for a toxic effect and especially true where thetarget organ is not the site of primary metabolism. This is particularlytrue for pro-drugs, where some aspect of in situ chemical (e.g.,enzymatic) modification is required for activity.

[0014] A new “metabonomic” approach has been proposed which is aimed ataugmenting and complementing the information provided by genomics andproteomics. “Metabonomics” is conventionally defined as “thequantitative measurement of the multiparametric metabolic response ofliving systems to pathophysiological stimuli or genetic modification”(see, for example, Nicholson et al., 1999). This concept has arisenprimarily from the application of ¹H NMR spectroscopy to study themetabolic composition of biofluids, cells, and tissues and from studiesutilising pattern recognition (PR), expert systems and otherchemoinformatic tools to interpret and classify complex NMR-generatedmetabolic data sets. Metabonomic methods have the potential, ultimately,to determine the entire dynamic metabolic make-up of an organism.

[0015] A pathological condition or a xenobiotic may act at thepharmacological level only and hence may not affect gene regulation orexpression directly. Alternatively significant disease or toxicologicaleffects may be completely unrelated to gene switching. For example,exposure to ethanol in vivo may switch on many genes but none of thesegene expression events explains drunkenness. In cases such as these,genomic and proteomic methods are likely to be ineffective. However, alldisease or drug-induced pathophysiological perturbations result indisturbances in the ratios and concentrations, binding or fluxes ofendogenous biochemicals, either by direct chemical reaction or bybinding to key enzymes or nucleic acids that control metabolism. Ifthese disturbances are of sufficient magnitude, effects will resultwhich will affect the efficient functioning of the whole organism. Inbody fluids, metabolites are in dynamic equilibrium with those insidecells and tissues and, consequently, abnormal cellular processes intissues of the whole organism following a toxic insult or as aconsequence of disease will be reflected in altered biofluidcompositions.

[0016] Fluids secreted, excreted, or otherwise derived from an organism(“biofluids”) provide a unique window into its biochemical status sincethe composition of a given biofluid is a consequence of the function ofthe cells that are intimately concerned with the fluid's manufacture andsecretion. For example, the composition of a particular fluid can carrybiochemical information on details of organ function (or dysfunction),for example, as a result of xenobiotics, disease, and/or geneticmodification. Similarly, the composition and condition of an organism'stissues are also indicators of the organism's biochemical status.Examples of biofluids include, for example, urine, blood plasma, milk,etc.

[0017] Biofluids often exhibit very subtle changes in metabolite profilein response to external stimuli. This is because the body's cellularsystems attempt to maintain homeostasis (constancy of internalenvironment), for example, in the face of cytotoxic challenge. One meansof achieving this is to modulate the composition of biofluids. Hence,even when cellular homeostasis is maintained, subtle responses todisease or toxicity are expressed in altered biofluid composition.However, dietary diurnal and hormonal variations may also influencebiofluid compositions, and it is clearly important to differentiatethese effects if correct biochemical inferences are to be drawn fromtheir analysis.

[0018] One 6f the most successful approaches to biofluid analysis hasbeen the use of NMR spectroscopy (see, for example, Nicholson et al.,1989); similarly, intact tissues have been successfully analysed usingmagic-angle-spinning ¹H NMR spectroscopy (see, for example, Moka et al.,1998; Tomlins et al., 1998).

[0019] The NMR spectrum of a biofluid provides a metabolic fingerprintor profile of the organism from which the biofluid was obtained, andthis metabolic fingerprint or profile is characteristically changed by adisease, toxic process, or genetic modification. For example, NMRspectra may be collected for various states of an organism, e.g.,pre-dose and various times post-dose, for one or more xenobiotics,separately or in combination; healthy (control) and diseased animal;unmodified (control) and genetically modified animal.

[0020] For example, in the evaluation of undesired toxic side-effects ofdrugs, each compound or class of compound produces characteristicchanges in the concentrations and patterns of endogenous metabolites inbiofluids that provide information on the sites and basic mechanisms ofthe toxic process. ¹H NMR analysis of biofluids has successfullyuncovered novel metabolic markers of organ-specific toxicity in thelaboratory rat, and it is in this “exploratory” role that NMR as ananalytical biochemistry technique excels. However, the biomarkerinformation in NMR spectra of biofluids is very subtle, as hundreds ofcompounds representing many pathways can often be measuredsimultaneously, and it is this overall metabonomic response to toxicinsult that so well characterises the lesion.

[0021] All biological fluids and tissues have their own characteristicphysico-chemical properties, and these affect the types of NMRexperiment that may be usefully employed. One major advantage of usingNMR spectroscopy to study complex biomixtures is that measurements canoften be made with minimal sample preparation (usually with only theaddition of 5-10% D₂O) and a detailed analytical profile can be obtainedon the whole biological sample. Sample volumes are small, typically 0.3to 0.5 mL for standard probes, and as low as 3 μL for microprobes.Acquisition of simple NMR spectra is rapid and efficient usingflow-injection technology. It is usually necessary to suppress the waterNMR resonance.

[0022] Many biofluids are not chemically stable and for this reason careshould be taken in their collection and storage. For example, cell lysisin erythrocytes can easily occur. If a substantial amount of D₂O hasbeen added, then it is possible that certain ¹H NMR resonances will belost by H/D exchange. Freeze-drying of biofluid samples also causes theloss of volatile components such as acetone. Biofluids are also veryprone to microbiological contamination, especially fluids, such asurine, which are difficult to collect under sterile conditions. Manybiofluids contain significant amounts of active enzymes, either normallyor due to a disease state or organ damage, and these may enzymes mayalter the composition of the biofluid following sampling. Samples shouldbe stored deep frozen to minimise the effects of such contamination.Sodium azide is usually added to urine at the collection point to act asan antimicrobial agent. Metal ions and or chelating agents (e.g., EDTA)may be added to bind to endogenous metal ions (e.g., Ca²⁺, Mg²⁺ andZn²⁺) and chelating agents (e.g., free amino acids, especiallyglutamate, cysteine, histidine and aspartate; citrate) to alter and/orenhance the NMR spectrum.

[0023] In all bases the analytical problem usually involves thedetection of “trace” amounts of analytes in a very complex matrix ofpotential interferences. It is, therefore, critical to choose a suitableanalytical technique for the particular class of analyte of interest inthe particular biomatrix which could be a biofluid or a tissue. Highresolution NMR spectroscopy (in particular ¹H NMR) appears to beparticularly appropriate. The main advantages of using ¹H NMRspectroscopy in this area are the speed of the method (with spectrabeing obtained in 5 to 10 minutes), the requirement for minimal samplepreparation, and the fact that it provides a non-selective detector forall the abnormal metabolites in the biofluid regardless of theirstructural type, providing only that they are present above thedetection limit of the NMR experiment and that they containnon-exchangeable hydrogen atoms. The speed advantage is of crucialimportance in this area of work as the clinical condition of a patientmay require rapid diagnosis, and can change very rapidly and socorrespondingly rapid changes must be made to the therapy provided.

[0024] NMR studies of body fluids should ideally be performed at thehighest magnetic field available to obtain maximal dispersion andsensitivity and most ¹H NMR studies have been performed at 400 MHz orgreater. With every new increase in available spectrometer frequency thenumber of resonances that can be resolved in a biofluid increases andalthough this has the effect of solving some assignment problems, italso poses new ones. Furthermore, there are still important problems ofspectral interpretation that arise due to compartmentation and bindingof small molecules in the organised macromolecular domains that exist insome biofluids such as blood plasma and bile. All this complexity neednot reduce the diagnostic capabilities and potential of the technique,but demonstrates the problems of biological variation and the influenceof variation on diagnostic certainty.

[0025] The information content of biofluid spectra is very high and thecomplete assignment of the ¹H NMR spectrum of most biofluids is usuallynot possible (even using 900 MHz NMR spectroscopy, the highest frequencycommercially available). However, the assignment problems varyconsiderably between biofluid types. Some fluids have near constantcomposition and concentrations and in these the majority of the NMRsignals have been assigned. In contrast, urine composition can be veryvariable and there is enormous variation in the concentration range ofNMR-detectable metabolites; consequently, complete analysis is much moredifficult. Those metabolites present close to the limits of detectionfor 1-dimensional (1D) NMR spectroscopy (ca. 100 nM for many metabolitesat 800 MHz) pose severe NMR spectral assignment problems. (In absoluteterms, the detection limit may be ca. 4 nmol, e.g., 1 μg of a 250 g/molcompound in a 0.5 mL sample volume.) Even at the present level oftechnology in NMR, it is not yet possible to detect many importantbiochemical substances, e.g. hormones, proteins or nucleic acids in bodyfluids because of problems with sensitivity, line widths, dispersion anddynamic range and this area of research will continue to betechnology-limited. In addition, the collection of NMR spectra ofbiofluids may be complicated by the relative water intensity, sampleviscosity, protein content, lipid content, low molecular weight peakoverlap.

[0026] Usually in order to assign ¹H NMR spectra, comparison is madewith spectra of authentic materials and/or by standard addition of anauthentic reference standard to the sample. Additional confirmation ofassignments is usually sought from the application of other NMR methods,including, for example, 2-dimensional (2D) NMR methods, particularlyCOSY (correlation spectroscopy), TOCSY (total correlation spectroscopy),inverse-detected heteronuclear correlation methods such as HMBC(heteronuclear multiple bond correlation), HSQC (heteronuclear singlequantum coherence), and HMQC (heteronuclear multiple quantum coherence),2D J-resolved (JRES) methods, spin-echo methods, relaxation editing,diffusion editing (including both 1D NMR and 2D NMR such asdiffusion-edited TOCSY), and multiple quantum filtering. Detailed ¹H NMRspectroscopic data for a wide range of metabolites and biomoleculesfound in biofluids have been published (see, for example, Lindon et al.,1999) and supplementary information is available in several literaturecompilations of data (see, for example, Fan, 1996; Sze et al., 1994).

[0027] For example, the successful application of ¹H NMR spectroscopy ofbiofluids to study a variety of metabolic diseases and toxic processeshas now been well established and many novel metabolic markers oforgan-specific toxicity have been discovered (see, for example,Nicholson et al., 1989; Lindon et al., 1999). For example, NMR spectraof urine is identifiably altered in situations where damage has occurredto the kidney or liver. It has been shown that specific and identifiablechanges can be observed which distinguish the organ that is the site ofa toxic lesion. Also it is possible to focus in on particular parts ofan organ such as the cortex of the kidney and even in favourable casesto very localised parts of the cortex. Finally it is possible to deducethe biochemical mechanism of the xenobiotic toxicity, based on abiochemical interpretation of the changes in the urine. A wide range oftoxins has now been investigated including mostly kidney toxins andliver toxins, but also testicular toxins, mitochondrial toxins andmuscle toxins.

[0028] However, a limiting factor in understanding the biochemicalinformation from both 1D and 2D-dimensional NMR spectra of tissues andbiofluids is their complexity. The most efficient way to investigatethese complex multiparametric data is employ the 1 D and 2D NMRmetabonomic approach in combination with computer-based “patternrecognition” (PR) methods and expert systems. These statistical toolsare similar to those currently being explored by workers in the fieldsof genomics and proteomics.

[0029] Pattern recognition (PR) is a general term applied to methods ofdata analysis which can be used to generate scientific hypotheses aswell as testing hypotheses by reducing mathematically the manyparameters.

[0030] PR methods may be conveniently classified as “supervised” or“unsupervised.” Unsupervised methods are used to analyse data withoutreference to any other independent knowledge, for example, withoutregard to the identity or nature of a xenobiotic or its mode of action.

[0031] Examples of unsupervised pattern recognition methods includeprincipal component analysis (PCA), hierarchical cluster analysis (HCA),and non-linear mapping (NLM).

[0032] One of the most useful and easily applied unsupervised PRtechniques is principal components analysis (PCA) (see, for example,Sharaf, 1986). Principal components (PCs) are new variables created fromlinear combinations of the starting variables with appropriate weightingcoefficients. The properties of these PCs are such that: (i) each PC isorthogonal to (uncorrelated with) all other PCs, and (ii) the first PCcontains the largest part of the variance of the data set (informationcontent) with subsequent PCs containing correspondingly smaller amountsof variance,

[0033] A data matrix, X, made up of rows where each row defines asample, and columns, where each column defines a particular spectraldescriptor, can be regarded as composed of a scores matrix, T, and aloadings matrix, L, such that X=TL^(t), where t denotes the transpose.The covariance matrix, C, is calculated from the data matrix, X. Theeigenvalues and eigenvectors of the covariance matrix are determined bydiagonalisation. The coordinates in eigenvector plots (the principalcomponents, PCs) are denoted “scores” and comprise the scores matrix T.The eigenvector coefficients are denoted “loadings” and comprise theloadings matrix L, and give the contributions of the descriptors to thePCs.

[0034] Thus a plot of the first two or three PC scores gives the “best”representation, in terms of information content, of the data set in twoor three dimensions, respectively. A plot of the first two principalcomponent scores, PC1 and PC2, is often called a “scores plot”, andprovides the maximum information content of the data in two dimensions.Such PC maps can be used to visualise inherent clustering behaviour fordrugs and toxins acting on each organ according to toxic mechanism, Ofcourse, the clustering information might be in lower PCs and these havealso to be examined.

[0035] In this simple metabonomic approach, a sample from an animaltreated with a compound of unknown toxicity is compared with a databaseof NMR-generated metabolic data from control and toxin-treated animals.By observing its position on the PR map relative to samples of knowneffect, the unknown toxin can often be classified. However,toxicological data are often more complex, with time-related developmentof lesions and associated shifts in NMR-detected biochemistry. Also, itis more rigorous to compare effects of xenobiotics in the originaln-dimensional NMR metabonomic space.

[0036] Hierarchical Cluster Analysis, another unsupervised patternrecognition method, permits the grouping of data points which aresimilar by virtue of being “near” to one another in somemulti-dimensional space whose coordinates are defined by the NMRdescriptors which may be, for example, the signal intensities forparticular assigned peaks in an NMR spectrum. A “similarity matrix,” S,is constructed with elements s_(ij)=1−r_(ij)/r_(ij) ^(max), where r_(ij)is the interpoint distance between points i and j (e.g., Euclideaninterpoint distance), and r_(ij) ^(max) is the largest interpointdistance for all points. The most distant pair of points will haves_(ij) equal to 0, since r_(ij) then equals r_(ij) ^(max). Conversely,the closest pair of points will have the largest s_(ij), approaching 1.

[0037] The similarity matrix is scanned for the closest pair of points.The pair of points are reported with their separation distance, and thenthe two points are deleted and replaced with a single combined point.The process is then repeated iteratively until only one point remains. Anumber of different methods may be used to determine how two clusterswill be joined, including the nearest neighbour method (also known asthe single link method), the furthest neighbour method, the centroidmethod (including centroid link, incremental link, median link, groupaverage link, and flexible link variations).

[0038] The reported connectivities are then plotted as a dendrogram (atree-like chart which allows visualisation of clustering), showingsample-sample connectivities versus increasing separation distance (orequivalently, versus decreasing similarity). The dendrogram has theproperty in which the branch lengths are proportional to the distancesbetween the various clusters and hence the length of the brancheslinking one sample to the next is a measure of their similarity. In thisway, similar data points may be identified algorithmically.

[0039] Non-linear mapping (NLM) involves calculation of the distancesbetween all of the points in the original multi-dimensional space. Thisis followed by construction of a map of points in 2 or 3 dimensionswhere the sample points are placed in random positions or at valuesdetermined by a prior principal components analysis. The least squarescriterion is used to move the sample points in the lower dimension mapto fit the inter-point distances in the lower dimension space to thosein the higher dimensional space. Non-linear mapping is therefore anapproximation to the true inter-point distances, but points close in theoriginal multi-dimensional space should also be close in 2 or 3dimensional space (see, for example, Brown et al., 1996; Farrant et al.,1992).

[0040] Alternatively, and in order to develop automatic classificationmethods, it has proved efficient to use a “supervised” approach to NMRdata analysis. Here, a “training set” of NMR metabonomic data is used toconstruct a statistical model that predicts correctly the “class” ofeach sample. This training set is then tested with independent data(“test set”) to determine the robustness of the computer-based model.These models are sometimes termed “Expert Systems,” but may be based ona range of different mathematical procedures. Supervised methods can usea data set with reduced dimensionality (for example, the first fewprincipal components), but typically use unreduced data, with fulldimensionality. In all cases the methods allow the quantitativedescription of the multivariate boundaries that characterise andseparate each class, for example, each class of xenobiotic in terms ofits metabolic effects. It is also possible to obtain confidence limitson any predictions, for example, a level of probability to be placed onthe goodness of fit (see, for example, Sharaf, 1986). The robustness ofthe predictive models can also be checked using cross-validation, byleaving out selected samples from the analysis.

[0041] Expert systems may operate to generate a variety of usefuloutputs, for example, (i) classification of the sample as “normal” or“abnormal” (this is a useful tool in the control of spectrometerautomation using sequential flow injection NMR spectroscopy); (ii)classification of the target organ for toxicity and site of actionwithin the tissue where in certain cases, mechanism of toxic action mayalso be classified; and, (iii) identification of the biomarkers of apathological disease condition or toxic effect for the particularcompound under study. For example, a sample can be classified asbelonging to a single class of toxicity, to multiple classes of toxicity(more than one target organ), or to no class. The latter case wouldindicate deviation from normality (control) based on the training setmodel but having a dissimilar metabolic effect to any toxicity classmodelled in the training set (unknown toxicity type). Under (ii), asystem could also be generated to support decisions in clinical medicine(e.g., for efficacy of drugs) rather than toxicity.

[0042] Examples of supervised pattern recognition methods include thefollowing, which are briefly described below: soft independent modellingof class analysis (SIMCA) (see, for example, Wold, 1976); partial leastsquares analysis (PLS) (see, for example, Wold, 1966; Joreskog, 1982;Frank, 1984); linear descriminant analysis (LDA) (see, for example,Nillson, 1965); K-nearest neighbour analysis (KNN) (see, for example,Brown et al., 1996); artificial neural networks (ANN) (see, for example,Wasserman, 1989; Anker et al., 1992; Hare, 1994); probabilistic neuralnetworks (PNNs) (see, for example, Parzen, 1962; Bishop, 1995; Speckt,1990; Broomhead et al., 1988; Patterson, 1996); rule induction (RI)(see, for example, Quinlan, 1986); and, Bayesian methods (see, forexample, Bretthorst, 1990).

[0043] As the size of metabonomic databases increases together withimprovements in rapid throughput of NMR samples (>300 samples per dayper spectrometer is now possible with the first generation of flowinjection systems), more subtle expert systems may be necessary, forexamples using techniques such as “fuzzy logic” which permit greaterflexibility in decision boundaries.

[0044] Pattern recognition methods have been applied to the analysis ofmetabonomic data, including, for example, complex NMR data, with somesuccess (see, for example, Anthony et al., 1994; Anthony et al., 1995;Beckwith-Hall et al., 1998; Gartland et al., 1990a; Gartland et al.,1990b; Gartland et al., 1991; Holmes et al., 1998a; Holmes et al.,1998b; Holmes et al., 1992; Holmes et al., 1994; Spraul et al., 1994;Tranter et al., 1999).

[0045] Although the utility of the metabonomic approach is wellestablished, there remains a great need for improved methods ofanalysis. The metabolic variation is often subtle, and powerful analysismethods are required for detection of particular analytes, especiallywhen the data (e.g., NMR spectra) are so complex.

[0046] One aim of the present invention is to provide data analysismethods for the detection of such metabolic variations, as part of ametabonomic approach.

SUMMARY OF THE INVENTION

[0047] One aspect of the present invention pertains to improved methodsfor the analysis of chemical, biochemical, and biological data, forexample spectra, for example, nuclear magnetic resonance (NMR) and othertypes of spectra.

[0048] One aspect of the invention pertains to a method for processing asample spectrum comprising:

[0049] replacing each of one or more target regions in said samplespectrum with a corresponding replacement region of a master controlspectrum to give a target-replaced sample spectrum,

[0050] wherein said replacement region has been scaled so as to have thesame fraction of the total integrated intensity in said target-replacedsample spectrum as it did in said master control spectrum.

[0051] One embodiment of the present invention pertains to a method forprocessing a sample spectrum for a test sample, said method comprisingthe steps of.

[0052] (a) identifying, in said sample spectrum, one or more targetregions for replacement;

[0053] (b) providing a master control spectrum which comprises onereplacement region corresponding to each of said target regions; and,

[0054] (c) replacing each of said target regions with the correspondingreplacement region to give a target-replaced sample spectrum,

[0055] wherein said replacement region has been scaled so as to have thesame fraction of the total integrated intensity in said target-replacedsample spectrum as it did in said master control spectrum.

[0056] In one embodiment of the present invention, the method furthercomprises the subsequent step of:

[0057] (d) normalising said target-replaced sample spectrum to give anormalised target-replaced sample spectrum.

[0058] One embodiment of the present invention pertains to a method forprocessing a sample NMR spectrum for a test sample, said methodcomprising the steps of:

[0059] (a) identifying, in said sample NMR spectrum, one or more targetregions for replacement, wherein each of said target regions is definedby a chemical shift range;

[0060] (b) providing a master control NMR spectrum which comprises onereplacement region corresponding to each of said target regions, whereina target region and its corresponding replacement region are defined bythe same chemical shift range; and,

[0061] (c) replacing each of said target regions with the correspondingreplacement region to give a target-replaced sample NMR spectrum,

[0062] wherein said replacement region has been scaled so as to have thesame fraction of the total integrated intensity in said target-replacedsample NMR spectrum as it did in said master control NMR spectrum.

[0063] In one embodiment of the present invention, the method furthercomprises the subsequent step of:

[0064] (d) normalising said target-replaced sample NMR spectrum to givea normalised target-replaced sample NMR spectrum.

[0065] In one embodiment of the present invention, in said replacingstep (c), each of said target regions is replaced with the correspondingreplacement region to give a target-replaced sample spectrum,

[0066] wherein said replacement region has been scaled by a factor, f,given by the formula:$f = \frac{I_{Y} - {\sum\limits_{k}I_{Y,{T\quad k}}}}{I_{C\quad M} - {\sum\limits_{k}I_{{C\quad M},{R\quad k}}}}$

[0067] wherein:

[0068] I_(Y) is the total integrated intensity of the sample spectrum;

[0069] I_(Y,Tk) is the integrated intensity of the target region;

[0070] I_(CM) is the total integrated intensity of the master controlspectrum;

[0071] I_(CM,Rk) is the integrated intensity of the replacement region;

[0072] k ranges from 1 to n_(t); and,

[0073] n_(t) is number of target regions.

[0074] Another aspect of the invention pertains to a sample spectrumwhich has been processed by a method according to the present invention.

[0075] Another aspect of the invention pertains to a method forprocessing a plurality of sample spectra, comprising processing each ofsaid sample spectra by a method according to the present invention.

[0076] Another aspect of the invention pertains to a method of analysisof an applied stimulus, comprising the steps of:

[0077] (a) providing one or more sample spectra for each of one or moresamples from each of one or more organisms which have been subjected tosaid applied stimulus;

[0078] (b) providing a master control spectrum derived from one or morecontrol spectra for each of one or more samples from each of one or moreorganisms which have not been subjected to said applied stimulus;

[0079] (c) processing each of said sample spectra using a methodaccording to the present invention.

[0080] In one preferred embodiment, the applied stimulus is axenobiotic. In one preferred embodiment, the applied stimulus is adisease state. In one preferred embodiment, the applied stimulus is agenetic modification.

[0081] Another aspect of the invention pertains to a method foridentifying a biomarker or biomarker combination for an appliedstimulus, comprising a method of analysis of an applied stimulus asdescribed herein.

[0082] Another aspect of the invention pertains to a biomarker orbiomarker combination identified by such a method.

[0083] Another aspect of the invention pertains to a method of diagnosisof an applied stimulus employing a biomarker identified by such amethod.

[0084] Another aspect of the invention pertains to an assay, whichemploys a biomarker identified by a method as described herein.

[0085] Another aspect of the invention pertains to a method ofclassifying an applied stimulus, comprising a method of analysis of anapplied stimulus as described herein.

[0086] Another aspect of the invention pertains to a method of diagnosisof an applied stimulus, comprising a method of analysis of an appliedstimulus as described herein.

[0087] Another aspect of the invention pertains to a method oftherapeutic monitoring of a subject undergoing therapy, comprising amethod of analysis of an applied stimulus as described herein.

[0088] Another aspect of the invention pertains to a method ofevaluating drug therapy and/or drug efficacy, comprising a method ofanalysis of an applied stimulus as described herein.

[0089] Another aspect of the invention pertains to a method of detectingtoxic side-effects of drug, comprising a method of analysis of anapplied stimulus as described herein.

[0090] Another aspect of the invention pertains to a method ofcharacterising and/or identifying a drug in overdose, comprising amethod of analysis of an applied stimulus as described herein.

[0091] In one preferred embodiment, the spectrum or spectra is an NMRspectrum or NMR spectra.

[0092] Another aspect of the invention pertains to a computer systemoperatively configured to implement a method according the presentinvention.

[0093] Another aspect of the invention pertains to computer codesuitable for implementing a method according to the present invention.

[0094] Another aspect of the invention pertains to a data carrier whichcarries computer code suitable for implementing a method according thepresent invention on a suitable computer system.

[0095] As will be appreciated by one of skill in the art, features andpreferred embodiments of one aspect of the invention will also pertainto other aspects of the invention.

BRIEF DESCRIPTION OF THE DRAWINGS

[0096]FIG. 1 is a graph showing the four base spectra, denoted A, B, C,and D, which were used to generate the simulated data in the Examples.

[0097]FIG. 2 is a graph showing the four animal factors, denoted AF_(A),AF_(B), AF_(C), and AF_(D), which were used to generate the simulateddata in the Examples.

[0098]FIG. 3 is a graph showing the four time factors, denoted TF_(A),TF_(B), TF_(C), and TF_(D), which were used to generate the simulateddata in the Examples.

[0099]FIG. 4 is a graph showing spectra for animal number 6 (A₆) at thefive time points (T₁-T₅), denoted (i), (ii), (iii), (iv), and (v),respectively, as well as the master control spectrum.

[0100]FIG. 5 is a graph showing, for animal number 6 at time point 2(A₆, T₂), (i) the original spectrum, before replacement; (ii) thespectrum after spectral replacement; and (iii) spectrum (ii) afterre-normalisation.

[0101]FIG. 6 is a graph showing, for animal number 6 at time point 3(A₆, T₃), (i) the original spectrum, before replacement; (ii) thespectrum after spectral replacement;

[0102] and (iii) spectrum (ii) after re-normalisation.

[0103]FIG. 7 is a graph showing a scores plot (principal component 1versus principal component 2) following principal component analysis ofthe sample spectra, wherein the spectral regions associated with theinterfering signal were deleted from all spectra.

[0104]FIG. 8 is a graph showing a scores plot (principal component 1versus principal component 2) following principal component analysis ofthe normalised target-replaced spectra, wherein the replaced regionswere treated as “missing data.”

[0105]FIG. 9 is a graph showing a scores plot (principal component 1versus principal component 2) following principal component analysis ofthe normalised target-replaced spectra, wherein the replaced regionswere not treated as “missing data.”

DETAILED DESCRIPTION OF THE INVENTION

[0106] The present invention pertains generally to the field ofchemometrics, metabonomics, and, more particularly, to methods for theanalysis of biological data, particularly spectra.

[0107] Biological Data

[0108] The methods of the present invention are applicable to chemical,biochemical, and biological data, for example, spectra, and especiallyspectra generated using types of spectroscopy and spectrometry which areuseful in chemical and biochemical (i.e., molecular) studies.

[0109] The methods described herein facilitate more powerful analysis ofspectral data. For example, the methods of the present invention makepossible the identification of spectral changes associated with an eventof interest from a spectral background which is non-specific and/orirrelevant.

[0110] In the context of studies of organisms, the event of interest maybe, for example, an applied stimulus. The term “applied stimulus,” asused herein, pertains to a stimulus under study which is applied to, oris present in, an organism(s) under study, and is not applied to, and isabsent in, a control organism(s). Examples of applied stimuli include,but are not limited to, a xenobiotic, a disease state, and a geneticmodification.

[0111] The term “xenobiotic,” as used herein, pertains to a substance(e.g., compound, composition) which is administered to an organism, orto which the organism is exposed. In general, xenobiotics are chemical,biochemical or biological molecules which are not normally present inthat organism, or are normally present in that organism, but not at thelevel obtained following administration. Examples of xenobiotics includedrugs, formulated medicines and their components, pesticides,herbicides, substances present in foods (e.g. plant compoundsadministered to animals), and substances present in the environment.

[0112] The term “disease state,” as used herein, pertains to a deviationfrom the normal healthy state of the organism. Examples of diseasestates include bacterial, viral, parasitic infections, cancer in all itsforms, degenerative diseases (e.g., arthritis, multiple sclerosis),trauma (e.g., as a result of injury), organ failure (includingdiabetes), cardiovascular disease (e.g., atherosclerosis, thrombosis),and inherited diseases caused by genetic composition (e.g., sickle-cellanaemia).

[0113] The term “genetic modification,” as used herein, pertains toalteration of the genetic composition of an organism. Examples ofgenetic modifications include the incorporation of a gene or genes intoan organism from another species, increasing the number of copies of anexisting gene or genes in an organism, removal of a gene or genes froman organism, rendering a gene or genes in an organism non-functional.

[0114] Examples of the types of spectroscopy which give spectra suitablefor the application of the methods of the present invention include, butare not limited to, the following: all regions of the electromagneticspectrum, including, for example, microwave spectroscopy; far infraredspectroscopy; infrared spectroscopy; Raman and resonance Ramanspectroscopy; visible spectroscopy; ultraviolet spectroscopy; farultraviolet (or vacuum ultraviolet) spectroscopy; x-ray spectroscopy;optical rotatory dispersion, circular dichroism (e.g., ultraviolet,visible and infrared); Mossbauer spectroscopy; atomic absorption andemission spectroscopy; ultraviolet fluorescence and phosphorescencespectroscopy; magnetic resonance, including nuclear magnetic resonance(NMR), electron paramagnetic resonance (EPR), MRI (magnetic resonanceimaging); and mass spectrometry, including variations of ionizationmethods, including electron impact, chemical ionisation, thermospray,electrospray, matrix assisted laser desorption ionization (MALDI),inductively coupled plasma, and detection methods, including sectordetection, quadrupole detection, ion-trap, time-of-flight, and Fouriertransform.

[0115] One particularly preferred class of spectroscopy is nuclearmagnetic resonance (NMR). Examples of such methods include 1D, 2D, and3D-NMR, including, for example, 1D spectra, such as single pulse,water-peak saturated, spin-echo such as CPMG (i.e., edited on the basisof nuclear spin relaxation times), diffusion-edited; 2D spectra, such asJ-resolved (JRES), ¹H-¹H correlation methods such as NOESY, COSY, TOCSYand variants thereof, methods which correlated 1H to heteronuclei(including, for example, ¹³C, ¹⁵N, ¹9F, and 31P), such as directdetection methods such as HETCOR and inverse-detected methods such as¹H-¹³C HMQC, HSQC and HMBO; 3D spectra, including many variants, whichare combinations of 2D methods, e.g. HMQC-TOCSY, NOESY-TOCSY, etc. Allof these NMR spectroscopic techniques can also be combined withmagic-angle-spinning (MAS) in order to study samples other thanisotropic liquids, such as tissues or foodstuffs, which arecharacterised by anisotropic composition.

[0116] Composite spectra, which are formed from two or more spectra ofdifferent types, may also be used

[0117] The methods of the present invention are applied to spectraobtained or recorded for particular samples under study. Samples may bein any form which is compatible with the particular type ofspectroscopy, and therefore may be, as appropriate, homogeneous orheterogeneous, comprising one or a combination of, a gas, a liquid, aliquid crystal, a gel, or a solid, and including samples with abiological origin.

[0118] Examples of such samples include those originating from anorganism, for example, a whole organism (living or dead, e.g., a livinghuman, a culture of bacteria); a part or parts of an organism (e.g., atissue sample, an organ, a leaf); a pathological tissue such as atumour; a tissue homogenate (e.g. a liver microsome fraction); anextract prepared from a organism or a part of an organism (e.g., atissue sample extract, such as perchloric acid extract); an infusionprepared from a organism or a part of an organism (e.g., tea, Chinesetraditional herbal medicines); an in vitro tissue such as a spheroid; asuspension of a particular cell type (e.g. hepatocytes); an excretion,secretion, or emission from an organism (especially a fluid); materialwhich is administered and collected (e.g., dialysis fluid, lung aspiratefluid); material which develops as a function of pathology (e.g., acyst, blisters), supernatant from a cell culture.

[0119] Examples of fluid samples include, for example, urine, (gallbladder) bile, blood plasma, whole blood, cerebrospinal fluid, milk,saliva, mucus, sweat, gastric juice, pancreatic juice, seminal fluid,prostatic fluid, seminal vesicle fluid, seminal plasma, amniotic fluid,foetal fluid, follicular fluid, synovial fluid, aqueous humour, ascitefluid, cystic fluid, and blister fluid, plus cell suspensions andextracts thereof.

[0120] Examples of tissue samples include liver, kidney, prostate,brain, gut, blood, skeletal muscle, heart muscle, lymphoid, bone,cartilage, and reproductive tissues.

[0121] Still other examples of samples include air (e.g., exhaust), aircondensates or extracts, water (e.g., seawater, groundwater, wastewater,e.g., from factories), liquids from the food industry (e.g. juices,wines, beers, other alcoholic drinks, tea, milk), solid-like foodsamples (e.g. chocolate, pastes, fruit peel, fruit and vegetable fleshsuch as banana, leaves, meats, whether cooked or raw, etc.).

[0122] The sample may also be a concentrate of a fluid, for example, aconcentrate of a fluid described above.

[0123] For samples which are, or are drawn from, an organism, theorganism, in general, may be a prokaryote (e.g., bacteria) or aeukaryote (e.g., protoctista, fungi, plants, animals).

[0124] The organism may be an alga or a protozoan.

[0125] The organism may be a plant, an angiosperm, a dicotyledon, amonocotyledon, a gymnosperm, a conifer, a ginkgo, a cycad, a fern, ahorsetail, a clubmoss, a liverwort, or a moss.

[0126] The organism may be a chordate, an invertebrate, an echinoderm(e.g., starfish, sea urchins, brittlestars), an arthropod, an annelid(segmented worms) (e.g., earthworms, lugworms, leeches), a mollusk(cephalopods (e.g., squids, octopi), pelecypods (e.g., oysters, mussels,clams), gastropods (e.g., snails, slugs)), a nematode (round worms), aplatyhelminthes (flatworms) (e.g., planarians, flukes, tapeworms), acnidaria (e.g., jelly fish, sea anemones, corals), or a porifera (e.g.,sponges).

[0127] The organism may be an arthropod, an insect (e.g., beetles,butterflies, moths), a chilopoda (centipedes), a diplopoda (millipedes),a crustacean (e.g., shrimps, crabs, lobsters), or an arachnid (e.g.,spiders, scorpions, mites),

[0128] The organism may be a chordate, a vertebrate, a mammal, a bird, areptile (e.g., snakes, lizards, crocodiles), an amphibian (e.g., frogs,toads), a bony fish (e.g., salmon, plaice, eel, lungfish), acartilaginous fish (e.g., sharks, rays), or a jawless fish (e.g.,lampreys, hagfish).

[0129] The organism may be a mammal, a placental mammal, a marsupial(e.g., kangaroo, wombat), a monotreme (e.g., duckbilled platypus), arodent (e.g., a guinea pig, a hamster, a rat, a mouse), murine (e.g., amouse), avian (e.g., a bird), canine (e.g., a dog), feline (e.g., acat), equine (e.g., a horse), porcine (e.g., a pig), ovine (e.g., asheep), bovine (e.g., a cow), a primate, simian (e.g., a monkey or ape),a monkey (e.g., marmoset, baboon), an ape (e.g., gorilla, chimpanzee,orangutang, gibbon), or a human,

[0130] Furthermore, the organism may be any of its forms, for example, aspore, a seed, an egg, a larva, a pupa, or a foetus.

[0131] Spectral Replacement

[0132] Spectra often have features (e.g. peaks, noise spikes, baselineartefacts, etc) which interfere with and/or reduce the power and/oraccuracy of subsequent analysis. Some of these features are artefacts ofthe particular types of spectra, its method of acquisition, adventitiousimpurities, and the like. However, more often these spectral featuresare chemical species not accidentally or unintentionally present in thesample under study. In order to improve the power and efficiency ofsubsequently spectral analysis, it is useful to identify and treatappropriately those parts of the spectra which are associated with suchspecies. In addition, spectral features introduced unintentionally needto identified and treated appropriately.

[0133] For example, in metabonomic studies, a sample from an organismunder study may show spectral evidence of a large number of metabolites,some of which provide little or no useful information about the appliedstimulus, yet interfere with subsequent data analysis. For example,spectral peaks from drugs and their metabolites often dominate themetabonomic description of the dosed organism, but their identificationand levels are sometimes of secondary importance.

[0134] In general, metabolites may be placed in one of three classes:

[0135] (A) Endogenous metabolites, the levels of which are significantlyaltered by the application of the applied stimulus. A single metaboliteof this type is typically referred to as a biomarker. In a more complexcase, where the levels of several, or more, metabolites are changed(whether increased or decreased), the group of metabolites are typicallyreferred to as a biomarker combination. For example, an increase intaurine together with creatine levels in urine is a general marker forliver damage. In a more complex example, toxins which cause lesions inthe S3 portion of the renal proximal tubule cause elevations of urinaryglucose, amino acids and organic acids with decreases in tricarboxylicacid cycle intermediates.

[0136] (B) Endogenous metabolites, the levels of which are unaffected byapplication of the applied stimulus.

[0137] (C) Metabolites, which appear in the sample and which arise froma xenobiotic itself or its metabolites. For example, paracetamol is seenin urine mainly as paracetamol sulfate and paracetamol glucuronideconjugates. In some cases unchanged paracetamol can also be seen. Ofcourse, these metabolites will be present only if the applied stimulusincludes a xenobiotic.

[0138] Metabolites falling in class B, and many of those metabolitesfalling in class C, i.e. not biomarkers or biomarker combinations,collectively referred to herein as “interfering signals”, often providelittle information about the organism's response to an applied stimulus,while dominating and interfering with the metabonomic description of thestimulated organism.

[0139] Whether or not a particular metabolite is, or is a candidate as,an interfering signal can often be determined from known data regardingthe applied stimulus under study. For example, there may be a large bodyof public knowledge regarding the metabolism of a particular compound,or of compounds having a particular substructure. Often, an interferingsignal, and its associated spectral features, can be readily identifiedby eye by the skilled artisan. However, if new spectral features areobserved which are not readily identified, the associated compoundsgiving rise to these features can be isolated and characterised usingknown methods, for example, by coupling liquid chromatography with NMRor mass spectrometry.

[0140] In some methods, those parts of the spectrum associated withthese interfering signals are excised. However, when comparing orcombining data from several studies (e.g., using different xenobiotics,different disease states, etc.), these parts of the spectrum areeffectively deleted from all spectra in a combined data set. The deletedregions can encompass a large fraction of the total spectral region,significantly reducing the information content of the combined set ofspectra, and thereby reducing the power and efficiency of subsequentlyapplied pattern recognition methods.

[0141] In some known methods, the excised parts of the spectrum are“filled,” for example, by replacing the excised spectral data with, forexample, zero intensity values (“zero fill”); with an arbitrary orpredetermined constant intensity value (“constant fill”); a randomintensity value (“random fill”); a mean intensity value (“mean fill”)calculated from the entire dataset; or an intensity value based on aprincipal component analysis (“principal component fill”).

[0142] However, rather than simply deleting, or deleting andsubsequently filling, these spectral regions, it is desirable to employa method of “spectral replacement” in which these spectral regions arereplaced with meaningful data, for example, corresponding scaledspectral regions from normal or control spectra (e.g., in the case oforganism studies, spectra associated with normal or control organisms).Subsequent normalisation may further improve the data content, byscaling the peak intensities to values which, in a sense, they wouldhave had if the interfering features (e.g. peaks) had not been included.

[0143] Therefore, whether for the metabonomic reasons discussed above,or for other reasons, the spectrum is subjected to the additional stepof “spectral replacement” as described herein. In general, spectralreplacement is performed following acquisition of the spectrum (orspectra), including the normal pre-processing associated with theparticular type of spectrum (e.g., signal averaging, Fouriertransformation, baseline correction, etc.), but before subsequentanalysis.

[0144] One aspect of the present invention pertains to a method forprocessing a sample spectrum comprising replacing each of one or moretarget regions in said sample spectrum with the correspondingreplacement region of a master control spectrum to give atarget-replaced sample spectrum, wherein the replacement region has beenscaled so as to have the same fraction of the total integrated intensityin said target-replaced sample spectrum as it did in said master controlspectrum.

[0145] Thus, one embodiment of the present invention pertains to amethod for processing a sample spectrum for a test sample, said methodcomprising the steps of:

[0146] (a) identifying, in said sample spectrum, one or more targetregions for replacement;

[0147] (b) providing a master control spectrum which comprises onereplacement region corresponding to each of said target regions; and,

[0148] (c) replacing each of said target regions with the correspondingreplacement region to give a target-replaced sample spectrum, whereinsaid replacement region has been scaled so as to have the same fractionof the total integrated intensity in the target-replaced sample spectrumas it did in the master control spectrum.

[0149] In a preferred embodiment, the methods further comprise asubsequent step of:

[0150] (d) normalising said target-replaced sample spectrum to give anormalised target-replaced sample spectrum.

[0151] Another embodiment of the present invention pertains to a methodfor processing a sample NMR spectrum for a test sample, said methodcomprising the steps of:

[0152] (a) identifying, in said sample NMR spectrum, one or more targetregions for replacement, wherein each of said target regions is definedby a chemical shift range;

[0153] (b) providing a master control NMR spectrum which comprises onereplacement region corresponding to each of said target regions, whereina target region and its corresponding replacement region are defined bythe same chemical shift range, and,

[0154] (c) replacing each of said target regions with the correspondingreplacement region to give a target-replaced sample NMR spectrum,wherein said replacement region has been scaled so as to have the samefraction of the total integrated intensity in said target-replacedsample NMR spectrum as it did in said master control NMR spectrum.

[0155] In a preferred embodiment, the methods further comprise asubsequent step of:

[0156] (d) normalising said target-replaced sample NMR spectrum to givea normalised target-replaced sample NMR spectrum.

[0157] Note that, in each of the above methods, step (b) may beperformed either before or after step (a).

[0158] The term “sample spectrum,” as used herein, pertains to anspectrum obtained from a sample under study. If there are several samplespectra, as is typically the case, each one is treated separately.

[0159] The sample spectrum is, to one degree or another, representativeof the composition of the sample. In general, a sample can begeneralised as an n-dimensional object, where the coordinate along eachof the axes or dimensions is the concentration of individual chemical orbiochemical species. Equivalently, the sample can be represented via itsspectrum, also as an n-dimensional object, y, where the coordinate alongeach of the axes or dimensions (y₁, Y₂, y₃, . . . y_(j)) is the spectralintensity (or equivalent parameter) at each data point. For example, fora 1D NMR spectrum, each of y₁, Y₂, y₃, etc. may represent signalintensity at different chemical shifts. It is not necessary to assignspectral features (e.g., peaks, features, lines) at this stage, since itis treated solely as a statistical object.

[0160] A sample spectra set, Y, may be formed from n_(y) sample spectra,each of which is denoted y_(I) (where i runs from 1 to n_(y)) and eachof which has descriptors y_(ij) (where ranges from 1 to the total numberof descriptors). Each sample spectrum, i, has a total integratedintensity, I_(Yi), given by: $I_{Y\quad i} = {\sum\limits_{j}y_{ij}}$

[0161] As mentioned above, the target regions are one or more spectralregions in the sample spectrum which are to be replaced, Each of one ormore target regions in the sample spectrum is replaced with thecorresponding and appropriately scaled replacement region of a mastercontrol spectrum. The target regions for the ith sample spectrum may bedenoted t_(i,k), where k ranges from 1 to n_(t), and n_(t) is the numberof target regions. In metabonomic studies, the target regions typicallypertain to, relate to, or otherwise reflect low correlation metabolites,as discussed above.

[0162] The master control spectrum may be a single spectrum, referred toherein as a “control spectrum,” or more preferably it is an averagespectrum calculated from two or more control spectra. Where the spectraare associated with an organism, the master control spectrum may be asingle spectrum from a control organism, referred to herein as a“control spectrum,” or more preferably it is an average spectrumcalculated from two or more control spectra. The control spectra may beobtained from a single control organism, or, more preferably, from twoor more control organisms. In the context of studies of organisms, thestimulus under study is not applied to, nor is present in, the controlorganism(s). For example, a control spectra set, C, may be formed fromn_(c) control spectra, each of which is denoted c_(I) and hasdescriptors c_(ij), where j runs between 1 and the number ofdescriptors. The master control spectrum, c_(M), having descriptorsC_(Mj), may be calculated as:$c_{Mj} = {\frac{1}{n_{c}}{\sum\limits_{i}c_{ij}}}$

[0163] The master control spectrum has a total integrated intensity,I_(CM), given by: $I_{C\quad M} = {\sum\limits_{j}c_{Mj}}$

[0164] The master control NMR spectrum comprises one replacement regioncorresponding to each of the target regions. The term “replacementregion(s),” as used herein, pertains to that part/those parts of themaster control spectrum which correspond(s) to the target region(s) ofthe sample spectrum. For example, if the spectrum is a 1D NMR spectrum,and a particular target region is defined as δ 7.2-7.7 in the sample NMRspectrum, then the corresponding replacement region is also defined as δ7.2-7.7 in the master control NMR spectrum.

[0165] Each replacement region(s) is scaled so that it represents thesame fraction of the total integrated intensity in the target-replacedsample spectrum as it did in the master control spectrum. For example,if a replacement region represented 2% of the total intensity in themaster control spectrum, then it must also account for 2% of the totalintensity in the target-replaced sample spectrum.

[0166] For example, consider the case where the sample spectrum, withintegrated intensity I_(Y), has a single target region with integratedintensity I_(T). The remainder of the spectrum has an integratedintensity of I_(Y)-I_(T). The master control spectrum has an integratedintensity of I_(CM), and the replacement region therein has anintegrated intensity of I_(R). The fraction of the total integratedintensity in the master control spectrum accounted for by thereplacement region is I_(R)/R_(CM). The replacement region is scaled bya factor, f, and thus the scaled replacement region has an integratedintensity of fI_(R). The target replaced spectrum now has an integratedintensity of I_(Y)−I_(T)+fI_(R).

[0167] The scale factor, f, is selected so that scaled replacementregion (intensity fI_(R)) has the same fraction of the total integratedintensity in the target-replaced sample spectrum(fI_(R)/[fI_(R)+I_(Y)−I_(T)]) as it did in the master control spectrum(I_(R)/I_(CM)), that is, fI_(R)/[fI_(R)+I_(Y)−I_(T)]=I_(R)/I_(CM).Rearranging this equation gives:$f = \frac{I_{Y} - I_{T}}{I_{C\quad M} - I_{R}}$

[0168] Consider also the case where the sample spectrum, with integratedintensity I_(Y), has two target region with integrated intensitiesI_(T1) and I_(T2), respectively. The remainder of the spectrum has anintegrated intensity of I_(Y)−I_(T1)−I_(T2). The master control spectrumhas an integrated intensity of I_(CM), and the respective replacementregions therein have integrated intensities of I_(R1) and I_(R2),respectively. The fraction of the total integrated intensity in themaster control spectrum accounted for by the first and secondreplacement regions is I_(R1)/I_(CM) and I_(R2)/I_(CM), respectively.The first replacement region is scaled by a factor, f₁, and thus thescaled first replacement region has an integrated intensity of f₁I_(R1).The second replacement region is scaled by a factor, f₂, and thus thescaled second replacement region has an integrated intensity off₂I_(R2). The target replaced spectrum now has an integrated intensityof I_(Y)−I_(T)+f₁I_(R1)+f₂I_(R2).

[0169] The scale factors, f₁ and f₂, are selected so that each scaledreplacement region (intensities f₁I_(R1) and f₂I_(R2), respectively) hasthe same fraction of the total integrated intensity in thetarget-replaced sample spectrum (f₁I_(R1)/[I_(Y−I)_(T)+f₁I_(R1)+f₂I_(R2)] and f₂I_(R2)/[I_(Y)−I_(T)+f₁I_(R1)+f₂I_(R2)],respectively) as it did in the master control spectrum (I_(R1)/I_(CM)and I_(R2)/I_(CM), respectively). This gives two simultaneous equations:f₁I_(R1)/[I_(Y)−I_(T)+f₁I_(R1)+f₂I_(R2)]=I_(R1)/I_(CM) andf₂I_(R2)/[I_(Y)−I_(T)+f₁I_(R1)+f₂I_(R2)]=I_(R2)/I_(CM), from which itcan be shown that:$f_{1} = {f_{2} = {f = \frac{I_{Y} - I_{T1} - I_{T2}}{I_{C\quad M} - I_{R1} - I_{R2}}}}$

[0170] In the general case, the target regions for the ith samplespectrum (Y_(i)) are denoted t_(I,k), the corresponding replacementregions are denoted r_(k), and in both cases, k ranges from 1 to n_(t),where n_(t) is the number of target regions.

[0171] For the kth target region of the ith sample spectrum, denotedt_(I,k), the integrated intensity, I_(Yi,Tk) is calculated as:$I_{{Y\quad i},{T\quad k}} = {\sum\limits_{j}y_{ij}}$

[0172] where the sum is over the descriptors, j, of that target region.

[0173] Similarly, for the replacement region of the master controlspectrum, denoted r_(k) (corresponding to the kth target region of theith sample spectrum, t_(I,k)), the integrated intensity is calculatedas: $I_{{C\quad M},{R\quad k}} = {\sum\limits_{j}c_{Mj}}$

[0174] where the sum is over the descriptors, j, of that replacementregion.

[0175] Thus, generalising the above examples, it may be shown that wherethere are many target regions, the scale factor for the ith samplespectrum is given by:$f_{i} = \frac{I_{Y\quad i} - {\sum\limits_{k}I_{{Y\quad i},{T\quad k}}}}{I_{C\quad M} - {\sum\limits_{k}I_{{C\quad M},{R\quad k}}}}$

[0176] wherein:

[0177] I_(Yi) is the total integrated intensity of the sample spectrum(before replacement);

[0178] I_(Yi,Tk) is the integrated intensity of the target region inquestion;

[0179] I_(CM) is the total integrated intensity of the master controlspectrum;

[0180] I_(CM,Rk) is the integrated intensity of the replacement regionin question; and, and k ranges from 1 to n_(t), the number of targetregions.

[0181] Thus, prior to replacement of the target region, y_(I,k), by itscorresponding replacement region, r_(k), that replacement region isscaled by (i.e., multiplied by) a factor, f_(i), given above. In thisway, for each sample spectrum and for each target region therein, atarget region, Y_(I,k), of integrated intensity I_(Yi,Rk) is replaced bya replacement region, r_(k), of integrated intensity f_(I)I_(CM,Rk).

[0182] The fourth step of the method, which is optional, but which ispreferred, involves normalising the target-replaced sample spectrum togive a “normalised target-replaced sample spectrum,” Normalisation istypically achieved by scaling the target-replaced sample spectrum togive unit total integrated intensity, that is, by scaling by a factor of1 divided by the total integrated intensity of the target-replacedsample spectrum, and thus may be expressed by the following formula:$y_{ij}^{R,N} = \frac{y_{ij}^{R}}{\sum\limits_{j}y_{ij}^{R}}$

[0183] wherein y_(ij) ^(R) denotes the descriptors of thetarget-replaced sample spectrum, and y_(ij) ^(R,N) denotes thedescriptors of the normalised target-replaced sample spectrum.

[0184] Once the spectra have been processed as described above, they maybe subjected to further analysis as appropriate for the particular typeof spectrum. A variety of known analysis methods may be employed,including, for example, those described in Press et al., 1983.

[0185] For example, for NMR spectra, conventional pattern recognitionmethods such as principal component analysis (PCA) may be applied. Forexample, it may be desirable to perform PCA using target-replacedspectra, or, more preferably, normalised target-replaced spectra.Similarly, it may or may not be desirable to treat the target-replacedregions as “missing data.”

[0186] Implementation

[0187] The methods of the present invention may be convenientlyperformed electronically, for example, using a suitably programmedcomputer system.

[0188] Thus, one aspect of the present invention pertains to a computersystem or device, such as a computer or linked computers, operativelyconfigured to implement the methods of the present invention.

[0189] Another aspect of the present invention pertains to computer codesuitable for implementing the methods of the present invention on asuitable computer system.

[0190] In one embodiment, the present invention pertains to a computerprogram comprising computer program means adapted to perform a methodaccording to the present invention when the program is run on acomputer.

[0191] Another aspect of the present invention pertains to a datacarrier which carries computer code suitable for implementing themethods of the present invention on a suitable computer.

[0192] In one embodiment, the present invention pertains to a computerprogram, as described above, embodied on a computer readable medium.

[0193] Examples of data carriers and computer readable media includechip media (e.g., ROM, RAM, flash memory (e.g. Memory Stick™, CompactFlash™, Smartmedia™), magnetic disk media (e.g., floppy disks, harddrives), optical disk media (e.g., compact disks (CDs), digitalversatile disks (DVDs), magneto-optical disks), and magnetic tape media.

[0194] Processing of NMR Spectra

[0195] Following data acquisition and initial pre-processing, butpreceding the application of subsequent analysis (e.g., patternrecognition), the data is subjected to additional pre-processing,including a step of “spectral replacement” as described herein.

[0196] NMR spectra are typically acquired, and subsequently, handled indigitised form. Conventional methods of spectral pre-processing of(digital) spectra are well known, and include, where applicable, signalaveraging, Fourier transformation (and other transformation methods),phase correction, baseline correction, smoothing, and the like (see, forexample, Lindon et al., 1980).

[0197] Modern spectroscopic methods often permit the collection of highor very high resolution spectra, In digital form, even a simple spectrum(e.g., signal intensity versus some function of energy or frequency) mayhave many thousands, if not tens of thousands of data points. It isoften desirable to reduce or compress the data to give fewer datapoints, for both practical computing methods and also to effect somedegree of signal averaging to compensate for physical effects, such aspH variation, compartmentalisation, and the like.

[0198] For example, a typical ¹H NMR spectrum is recorded as signalintensity versus frequency. NMR signals from 1H nuclei have acharacteristic position on this axis called a chemical shift. This isthe frequency of observation relative to that of a reference signal.When this is divided by the observation frequency, this chemical shiftis dimensionless, is given in parts per million (ppm) and is denoted bythe symbol δ. For brevity this axis will be termed the chemical shiftaxis. For ¹H NMR spectra, this ranges from about δ 0 to δ 10. At atypical frequency resolution of about 10⁻⁴ to 10⁻³ ppm, the spectrum indigital form comprises about 10,000 to 100,000 data points (typically 2to the power 16, or 64 k, or 65536).

[0199] As discussed above, it is often desirable to compress this data,for example, by a factor of about 10 to 100, to about 1000 descriptors.

[0200] For example, in one approach, the chemical shift axis, δ, is“segmented” into “buckets” or “bins” of a specific length. For a 1-D ¹HNMR spectrum which spans the range from δ 0 to δ 10, using a bucketlength, Δδ, of 0.04 yields 250 buckets, for example, δ 10.0-9.96, δ9.96-9.92, δ 9.92-9.88, etc. The signal intensity within a given bucketmay be averaged or integrated, and the resulting value reported. In thisway, a spectrum with, for example, 100,000 original data points can becompressed to an equivalent representation with, for example, 250 datapoints.

[0201] A similar approach can be applied to 2-D spectra, 3-D spectra,and the like. For 2-D spectra, the “bucket” approach may be extended toa “patch.” For 3-D spectra, the “bucket” approach may be extended to a“volume.” For example, a 2-D ¹H NMR spectrum which spans the range fromδ 0 to δ 10 on both axes, using a patch of Δδ 0.1×Δδ 0.1 yields 10,000patches. In this way, a spectrum with perhaps 10⁸ original data pointscan be compressed to an equivalent spectrum of 10⁴ data points.

[0202] Software for such processing of NMR spectra, for example AMIX(Analysis of MIXture, V 2.5, Bruker Analytik, Rheinstetten, Germany) iscommercially available.

[0203] Often, certain spectral regions carry no real diagnosticinformation, or carry conflicting biochemical information, and it isoften useful to remove these “redundant” regions before performingdetailed analysis. In the simplest approach, the data points aredeleted. In another simple approach, the data in the redundant regionsare replaced with zero values.

[0204] For example, due to the dynamic range problem with water incomparison with other molecules, the water resonance (around δ 4.7) issuppressed. However, small variations in water suppression remain, andthese variations can undesirably complicate analysis. Similarly,variations in water suppression may also affect the urea signal (aroundδ 5.5), by cross saturation. Therefore, it is often useful to delete thecertain spectral regions, for example, from about δ 4.5 to 6.0 (e.g., δ4.52 to 6.00).

[0205] Certain metabolites exhibit a strong degree of physiologicalvariation (e.g., diurnal variation, dietary-related variation) that isunrelated to any pathophysiological process. Such variation mayundesirably complicate analysis, and mask more relevant details.Therefore, it may be useful to delete the spectral regions associatedwith such compounds. However, it is often possible to isolate theseeffects in later (e.g., pattern recognition) analysis.

[0206] Xenobiotics (e.g., drugs) and their metabolites often give riseto large signals which do not directly correlate to the conditions(e.g., pathologies) which are induced by the xenobiotic. Therefore, itis often useful to delete the spectral regions associated with suchcompounds.

[0207] In general, NMR data is handled as a data matrix. Typically, eachrow in the matrix corresponds to an individual sample (often referred toas a “data vector”), and the entries in the columns are, for example,spectral intensity of a particular data point, at a particular δ or Δδ(often referred to as “descriptors”).

[0208] It is often useful to pre-process data, for example, byaddressing missing data, translation, scaling, and weighting.

[0209] If at all possible, missing data, for example, gaps in columnvalues, should be avoided. However, if necessary, such missing data mayreplaced or “filled” with, for example, the mean value of a column(“mean fill”); a random value (“random fill”); or a value based on aprincipal component analysis (“principal component fill”). Each of thesedifferent approaches will have a different effect on subsequent PRanalysis.

[0210] “Translation” of the descriptor coordinate axes can be useful.Examples of such translation include normalisation and mean centring.

[0211] “Normalisation” may be used to remove sample-to-sample variation.Many normalisation approaches are possible, and the can often be appliedat any of several points in the analysis. Usually, normalisation isapplied after redundant spectral regions have been removed. In oneapproach, each spectrum is normalised (scaled) by a factor of 1/A, whereA is the sum of the absolute values of all of the descriptors for thatspectrum. In this way, each data vector has the same length,specifically, 1. For example, if the sum of the absolute values ofintensities for each bucket in a particular spectrum is 1067, then theintensity for each bucket for this particular spectrum is scaled by1/1067.

[0212] “Mean centring” may be used to simplify interpretation. Usually,for each descriptor, the average value of that descriptor for allsamples is subtracted. In this way, the mean of a descriptor coincideswith the origin, and all descriptors are “centred” at zero. For example,if the average intensity at δ 10.0-9.96, for all spectra, is 1.2 units,then the intensity at δ 10.0-9.96, for all spectra, is reduced by 1.2units.

[0213] In “unit variance (UV) scaling,” data can be scaled to equalvariance. Usually, the value of each descriptor is scaled by 1/StDev,where StDev is the standard deviation for that descriptor for allsamples. For example, if the standard deviation at δ 10.0-9.96, for allspectra, is 2.5 units, then the intensity at δ 10.0-9.96, for allspectra, is scaled by 1/2.5 or 0.4. Unit variance scaling may be used toreduce the impact of “noisy” data. For example, some metabolites inbiofluids show a strong degree of physiological variation (e.g., diurnalvariation, dietary-related variation) that is unrelated to anypathophysiological process. Without unit variance scaling, these noisymetabolites may dominate subsequent analysis.

[0214] “Logarithmic scaling” may be used to assist interpretation whendata have a positive skew and/or when data spans a large range, e.g.,several orders of magnitude. Usually, for each descriptor, the value isreplaced by the logarithm of that value. For example, the intensity at δ10.0-9.96 is replaced the logarithm of the intensity at δ 10.0-9.96, forall spectra.

[0215] In “equal range scaling,” each descriptor is divided by the rangeof that descriptor for all samples. In this way, all descriptors havethe same range, that is, 1. For example, if, at δ 10.0-9.96, for allspectra, the largest value is 87 units and the smallest value is 1, thenthe range is 86 units, and the intensity at δ 10.0-9.96, for allspectra, is divided by 86 units. However, this method is sensitive topresence of outlier points.

[0216] In “autoscaling,” each data vector is mean centred and unitvariance scaled. This technique is a very useful because each descriptoris then weighted equally and, in the case of NMR descriptors, large andsmall peaks are treated with equal emphasis. This can be important formetabolites present at very low levels but still NMR-detectable.

[0217] Several supervised methods of scaling data are also known. Someof these can provide a measure of the ability of a parameter (e.g., adescriptor) to discriminate between classes, and can be used to improveclassification by stretching a separation.

[0218] For example, in “variance weighting,” the variance weight of asingle parameter (e.g., a descriptor) is calculated as the ratio of theinter-class variances to the sum of the intra-class variances. A largevalue means that this variable is discriminating between the classes.For example, if the samples are known to fall into two classes (e.g., atraining set), it is possible to examine the mean and variance of eachdescriptor. If a descriptor has very different mean values and a smallvariance, then it will be good at separating the classes.

[0219] “Feature weighting” is a more general description of varianceweighting, where not only the mean and standard deviation of eachdescriptor is calculated, but other well known weighting factors, suchas the Fisher weight, are used.

[0220] Spurious or irregular data (“outliers”), which are notrepresentative, are preferably identified and removed. Common reasonsfor irregular data (“outliers”) include poor phase correction, poorbaseline correction, poor chemical shift referencing, poor watersuppression, bacterial contamination, shifts in the pH of the biofluid,toxin- or disease-induced biochemical response, and idiosyncraticresponse to xenobiotics.

[0221] Outliers are identified in different ways depending on the methodof analysis used. For example, when using principal component analysis(PCA), small numbers of samples lying far from the rest of the replicategroup can be identified by eye as outliers. A more objective means ofidentification for PCA is to use the Hotelling's T Test which is themultivariate version of the well known Student's T test used inunivariate statistics. For any given sample, the T2 value can becalculated and this is compared with a standard value within which achosen fraction (e.g., 95%) of the samples would normally lie Sampleswith T2 values substantially outside this limit can then be flagged asoutliers. Also, when using more sophisticated supervised methods, suchas SIMCA or PNNs, a similar method is used. A confidence level (e.g.,95%) is selected and the region of multivariate space corresponding toconfidence values above this limit is determined. This region can bedisplayed graphically in several different ways (for example by plottingthe critical T2 ellipse on a PCA scores plot). Any samples fallingoutside the high confidence region are flagged as potential outliers.Naturally, such samples are investigated in detail to determine thecauses of their outlying nature before removing them from the model.

[0222] Applications

[0223] As discussed above, the methods of the present Invention may beused in the analysis of chemical, biochemical, and biological data.

[0224] Metabonomic methods, in conjunction with the methods of thepresent invention, provide powerful means for the diagnosis, prognosis,and treatment of disease, for understanding the benefits andside-effects of xenobiotic compounds thereby aiding the drug developmentprocess, as well as for improving therapeutic regimes for current drugs.

[0225] For example, applications of metabonomic methods, in conjunctionwith the methods of the present invention, include, but are not limitedto, early detection of abnormality/problem; differential diagnosis(classification of disease); prognosis (prediction of future outcome);therapeutic monitoring; identifying, classifying, determining theprogress of, and monitoring the treatment of, infectious diseases;clinical evaluations of drug therapy and efficacy; detection of toxicside-effects of drugs and model compounds (e.g., in the drug developmentprocess and in clinical trials); investigation of idiosyncratictoxicity; characterization and identification of drugs used in overdose;classification, fingerprinting, and diagnosis of metabolic diseases(e.g., inborn errors of metabolism); improvement in the quality controlof transgenic animal models of disease; aiding the design of transgenicmodels of disease; and searching for new biochemical markers of diseaseand/or tissue or organ damage.

[0226] Metabonomic methods, in conjunction with the methods of thepresent invention, may be used as an alternative or adjunct to thevarious genomic, pharmacogenomic, and proteomic methods, including thosedescribed above.

[0227] Metabonomic methods, in conjunction with the methods of thepresent invention, may also be used to identify (known or novel)genotypes and/or phenotypes, and to determine an organism's phenotype orgenotype. This may assist with the choice of a suitable treatment orallow assessment of its relevance in a drug development process. Forexample, the generation of metabonomic data in panels of individualswith disease states, infected states, or undergoing treatment mayindicate response profiles of groups of individuals which can bedifferentiated into two or more subgroups, indicating that an allelicgenetic basis for response to the disease, state, or treatment exists.For example a particular phenotype may not be susceptible to treatmentwith a certain drug, while another phenotype may be susceptible totreatment. Conversely, one phenotype might show toxicity because of afailure to metabolise and hence excrete a drug, which drug might be safein another phenotype as it does not exhibit this effect. For example,metabonomic methods may be used to determine the acetylator status of anorganism: there are two phenotypes, corresponding to “fast” and “slow”acetylation of drug metabolites. Phenotyping may be achieved on thebasis of the urine alone (i.e., without dosing a xenobiotic), or on thebasis of urine following dosing with a xenobiotic which has thepotential for acetylation (e.g., galactosamine). Similar methods mayalso be used to determine other differences, such as other enzymaticpolymorphisms, for example, cytochrome P450 polymorphism.

[0228] Metabonomic methods, in conjunction with the methods of thepresent invention, may also be used in studies of the biochemicalconsequences of genetic modification, for example, in “knock-outanimals” where one or more genes have been removed or madenon-functional; in “knock-in” animals where one or more genes have beenincorporated from the same or a different species; and in animals wherethe number of copies of a gene has been increased, as in the model whichresults in the over-expression of the beta amyloid protein in micebrains as a model for Alzheimer's disease). Genes can be transferredbetween bacterial, plant and animal species.

[0229] The combination of genomic, proteomic, and metabonomic data setsinto comprehensive “bionomic” systems may permit an holistic evaluationof perturbed in vivo function.

[0230] The methods of the present invention are also useful in otherapplications, including investigations into the effects of environmentalpollutants (e.g., wastewater analysis, animal population studies,studies of invertebrates, marine organisms), and the effects ofxenobiotic stimuli and genetic changes in plants.

EXAMPLES

[0231] The following examples are provided solely to illustrate thepresent invention and are not intended to limit the scope of theinvention, as described herein.

[0232] The methods of the present invention have been exemplified intheir application to NMR spectra. Nonetheless, the methods of thepresent invention are similarly applicable to other types of spectra,such as those discussed above.

[0233] A spectral data set consisting of 75 spectra was simulated,representing spectra taken at five time points (T₁, T₂, T₃, T₄, and T₅)for three groups of five animals (A₁-A₅, A₆-A₁₀, and A₁₁-A₁₆). The firstgroup of animals (A₁-A₅) were control animals. The second group ofanimals (A₆-A₁₀) were dosed animals. The third group of animals(A₁₁-A₁₅) were also dosed animals, but differently so (for example, witha different drug/toxin, or a different amount of the same drug/toxin).

[0234] The data set was generated using a PARAFAC model (See, forexample, Bro, 1997). In this model, the generated spectra were linearcombinations of the four base spectra (denoted A, B, C, and D) shown inFIG. 1, where chemical shift (represented by spectral bin number) isalong the x-axis, and spectral intensity is along the y-axis. Thecontribution of each base spectrum is determined by two correspondingfactors, the animal factor and the time factor, discussed below.

[0235] The animal factors (denoted AF_(A), AF_(B), AF_(C), and AF_(D))are shown in FIG. 2, where the animal number (A₁-A₁₅) is along thex-axis, and the animal factor is along the y-axis. Thus, for each basespectrum and animal, there is an animal factor, e.g., AF_(B-A7) for basespectrum B and animal 7.

[0236] The time factors (denoted TF_(A), TF_(B), TF_(C), and TF_(D)) areshown in FIG. 3, where the time point (T₁-T₅) is along the x-axis andthe time factor is along the y-axis. Thus, for each base spectrum andtime point, there is a time factor, e.g., TF_(B)-T₃ for base spectrum Band time point 3.

[0237] For example, the spectrum for animal number 7 (A₇) at time point3 (T₃) is a linear combination of the four base spectra (A, B, C, D),with coefficients (AF_(A-A7)*TF_(A-T3)), (AF_(B-A7)*TF_(B-T3)),(AF_(C-A7)*TF_(C-T3)) and (AF_(D-A7)*TF_(D-T3)), respectively.

[0238] For example, spectra for animal number 6 (A6) at the five timepoints (T₁-T₅) are shown in FIG. 4, as curves (i), (ii), (iii), (iv),and (v), respectively. Spectrum (i) is for A₆-T₁; spectrum (ii) is forA₆-T₂; spectrum (iii) is for A₆-T₃; spectrum (iv) is for A₆-T₄; andspectrum (v) is for A₆-T₅. Those peaks marked (X) will be the subject ofspectral replacement (see below). The peaks marked (Y) are theendogenous metabolites associated with the animals' response to theapplied stimulus.

[0239] For the control animals (A₁-A₅), the animal factors AF_(B),AF_(C), and AF_(D) are all very small (less than about 0.05) while theanimal factor AF_(A) is large and approximately constant (about 0.5).Therefore, the spectra for the control animals is dominated by basespectrum A. Also, the time factor for base spectrum A, TF_(A), isapproximately constant for all time points (about 0.45). Therefore,qualitatively (and as expected in a real control group), all spectra forthe control animals are very similar. The master control spectrum, inthis case, the mean of all 25 control spectra (5 animals, time points),is shown in FIG. 4.

[0240] For the second group of animals (A₆-A₁₀), the animal factorAF_(D) is very small (less than about 0,05) while the animal factorAF_(A) is about 0.5, and the animal factors AF_(B) and AF_(C) are about1.0. Therefore, the spectra for the second group of animals is dominatedby the base spectra A, B and C. Also, the time factor for base spectrumA, TF_(A), is approximately constant for all time points (about 0.45),while the time factor for the base spectrum B, TF_(B), varies from about0.1 to about 0.65, and peaks at time point 3, and the time factor forthe base spectrum C, TF_(C), varies from about 0.1 to about 0.55, andpeaks at time point 4. Therefore, qualitatively, the spectra for thesecond group of animals will resemble the base spectrum A, but withvarying amounts of base spectra B and C superposed thereupon.

[0241] For the third group of animals (A₁₁-A₁₅), the animal factorsAF_(B) and AF_(C) are very small (less than about 0.05) while the animalfactor AF_(A) is about 0.5, and the animal factor AF_(D) is about 1.0.Therefore, the spectra for the third group of animals is dominated bythe base spectra A and D. Also, the time factor for base spectrum A,TF_(A), is approximately constant for all time points (about 0.45),while the time factor for the base spectrum D, TF_(D), varies from about0.1 to about 0.75, and peaks at time point 4. Therefore, qualitatively,the spectra for the third group of animals will resemble the basespectrum A, but with varying amounts of base spectrum D superposedthereupon.

[0242] As discussed above, base spectrum A qualitatively represents thespectrum for control animals (although it is also present in the spectrafor dosed animals). For the purposes of this example, base spectrum Bqualitatively represents a metabolite or metabolites of the administereddrug/toxin (i.e., an interfering signal), while base spectra C and Dqualitatively represent different biomarkers or biomarker combinationsof the animals' response to the two different drug/toxin regimes.

[0243] Using a conventional analysis, the spectral regions associatedwith the interfering signal (i.e., in base spectrum B) were identifiedas target regions (in this example, spectral bin numbers 15-26 inclusiveand 47-58 inclusive), and the data in these spectral regions deletedfrom all spectra. The resulting “deleted” spectra were re-normalised andthen analysed by principal component analysis.

[0244] The resulting scores plot (PC2 versus PC1) is shown in FIG. 7.Two groups of data points were clearly separated (from the controlpopulation), specifically A₆-A₁₀, T₂ and A₆-A₁₀,T₃. Two groups of datapoints were partially separated, specifically A₆-A₁₀, T₁ and A₆-A₁₀, T₄.Several groups of data points were not separated, specifically A₆-A₁₀,T₅ and A₁₁-A₁₅, T₁₋₅.

[0245] Using the methods of the present invention, the target regions,that is, the spectral regions associated with the interfering signal(i.e., in base spectrum B) were identified (in this example, spectralbin numbers 15-26 inclusive and 47-58 inclusive). A master controlspectrum was calculated as the mean of the 25 control animal spectra(the master control spectrum is shown in FIG. 4). The target regions inall spectra for animals 6-10 were then replaced with correspondingscaled replacement regions from the master control spectrum. Theresulting target-replaced spectra were then renormalized to givenormalised target-replace spectra.

[0246] Two examples of this process are shown in FIGS. 5 (for thespectrum for animal number 6 at time point 2, A₆, T₂) and FIG. 6 (forthe spectrum for animal number 6 at time point 3, A₆, T₃). In each case:spectrum (i) is the original spectrum, before replacement; spectrum (ii)is the spectrum after spectral replacement; spectrum (iii) is spectrum(ii) after re-normalisation; the first target region (T-I) was spectralbin numbers 15-26 inclusive, and the second target region (T-II) wasspectral bin numbers 47-58 inclusive, as indicated by the verticaldotted lines. The numerical parameters are summarised in the tablebelow. TABLE 1 Parameters for Spectral Replacement i = A₆T₂ i = A₆T₃I_(Yi) 3.75 2.56 Σ I_(Yl,Tk) 1.77 1.55 I_(CM) 0.55 0.55 Σ I_(CM,R) 0.240.24 f_(i) 6.33 3.23 N_(f) 1.82 1.82

[0247] The resulting normalised target-replaced spectra were thenanalysed by principal component analysis. In one analysis, the replacedregions were treated as “missing data” (a conventional method in PCAanalysis) and the resulting scores plot (PC2 versus PC1) is shown inFIG. 8. Nine groups of data points were clearly separated (from thecontrol population), specifically A₆-A₁₀, T₁₋₅ and A₁₁-A₁₅, T₁₋₄. Onegroup of data points was not separated, specifically A₁₁-A₁₅, T₅.

[0248] In another analysis, the replaced regions were not treated as“missing data” and the resulting scores plot (PC2 versus PC1) is shownin FIG. 9. Eight groups of data points were clearly separated (from thecontrol population), specifically A₆-A₁₀, T₁₋₄ and A₁₁-A₁₅, T₁₋₄. Onegroup of data points was partially separated, specifically A₆-A₁₀, T₅.One group of data points was not separated, specifically A₁₁-A₁₅, T₅.

[0249] The clear separation of the A₁₁-A₁₆ data in FIGS. 8 and 9(specifically, A₁₁-A₁₅, T₁₋₄), as compared to the lack of theirseparation in FIG. 7, demonstrates the effectiveness of the methods ofthe present invention, specifically, in retrieving information thatwould otherwise have been lost or missed.

REFERENCES

[0250] A number of patents and publications are cited above in order tomore fully describe and disclose the invention and the state of the artto which the invention pertains. Full citations for these references areprovided below. Each of these references is incorporated herein byreference in its entirety into the present disclosure, to the sameextent as if each individual reference was specifically and individuallyindicated to be incorporated by reference.

[0251] Anker, L. S., and Jurs, P. C., 1992, “Prediction of C-13 nuclearmagnetic resonance chemical shifts by artificial neural networks,” Anal.Chem., Vol. 64, p. 1157.

[0252] Anthony, M. L. et al., 1994, “Pattern recognition classificationof the site of nephrotoxicity based on metabolic data derived fromproton nuclear magnetic resonance spectra of urine,” Mol. Pharmacol.,Vol. 46, pp. 199-211.

[0253] Anthony, M. L. et al., 1995, “Classification of toxin-inducedchanges in ¹H NMR spectra of urine using an artificial neural network,”J. Pharm. Biomed. Anal., Vol. 13, pp. 205-211.

[0254] Beckwith-Hall, B. M. et al., 1998, “Nuclear magneticspectroscopic and principal components analysis investigations intobiochemical effects of three model hepatotoxins,” Chem. Res. Tox., Vol.11, pp. 260-272.

[0255] Bishop, C., 1995, Neural Networks for Pattern Recognition,University Press, Oxford, England, pp. 164-193.

[0256] Bretthorst, 1990, “An Introduction to Parameter Estimation UsingBayesian Probability Thoery,” in: Maximum Entropy and Bayesian Methods,(Fougere, P. F., editor) (Kluwer Academic Publishers, The Netherlands),pp. 53-79.

[0257] Bro, R., 1997, “PARAFAC. Tutorial and applications,” inChemometrics and Intelligent Laboratory Systems, Vol. 38, pp. 149-171.

[0258] Broomhead, D. S., and Lowe, D., 1988, “Multi-variable functionalinterpolation and adaptive networks,” Complex Systems, Vol. 2, p. 321.

[0259] Brown, T. R. and Stoyanova, R., 1996, “NMR spectral quantitationby principal component analysis. 2. Determination of frequency and phaseshifts,” J. Magn. Reson., Vol. 112B, p. 32.

[0260] Fan, T. W. -M., 1996, “Metabolite profiling by one- andtwo-dimensional NMR analysis of complex mixtures,” Prog. NMR Spectrosc.,Vol. 28, pp. 161-219.

[0261] Farrant, R. D., et al., 1992, “An automatic data reduction andtransfer method to aid pattern recognition analysis and classificationof NMR spectra,” J. Pharm. Biomed. Anal., Vol. 10, p. 141.

[0262] Frank, I. E., et al., 1984, “Prediction of product quality fromspectral data using the partial least squares method,” J. Chem. Info.Comp., Vol. 24, p. 20.

[0263] Gartland, K. P. R. et al., 1990a, “A pattern recognition approachto the comparison of ¹H NMR and clinical chemical data forclassification of nephrotoxicity,” J. Pharm. Biomed. Anal., Vol. 8, pp.963-968.

[0264] Gartland, K. P. R. et al., 1990b, “Pattern recognition analysisof high resolution ¹H NMR spectra of urine. A nonlinear mapping approachto the classification of toxicological data,” NMR in Biomed., Vol. 3,pp. 166-172.

[0265] Gartland, K. P. R. et al., 1991, “The application of patternrecognition methods to the analysis and classification of toxicologicaldata derived from proton NMR spectroscopy of urine,” Mol. Pharmacol.,Vol. 39, pp. 629-642.

[0266] Geisow, M. J., 1998, “Proteomics: One small step for a digitalcomputer, one giant leap for humankind,” Nature Biotechnology, Vol. 16,p. 206.

[0267] Gygi, S. P., et al., 1999, “Correlation between protein and mRNAabundance in yeast,” Molecular and Cellular Biology, Vol. 19, pp.1720-1730.

[0268] Hare, B. J., and Prestegard, J. H., 1994, “Application ofnetworks to automated assignment of NMR spectra of proteins,” J. Biomol.NMR, Vol. 4, p. 35.

[0269] Holmes, E. et al., 1998a, “Development of a model forclassification of toxin-induced lesions using ¹H NMR spectroscopy ofurine combined with pattern recognition,” NMR in Biomed., Vol. 11, pp.235-244.

[0270] Holmes, E. et al., 1998b, “The identification of novel biomarkersof renal toxicity using automatic data reduction techniques and PCA ofproton NMR spectra of urine,” Chemomet. & Intel. Lab Systems, Vol. 44,pp. 245-255.

[0271] Holmes, E., et al., 1992, “NMR spectroscopy and patternrecognition analysis of the biochemical processes associated with theprogression and recovery from nephrotoxic lesions in the rat induced bymercury(II)chloride and 2-bromo-ethanamine,” Mol. Pharmacol., Vol. 42,pp. 922-930.

[0272] Holmes, E., et al., 1994, “Automatic data reduction and patternrecognition methods for analysis of ¹H NMR spectra of human urine fromnormal and pathological states,” Anal. Biochem., Vol. 220, pp. 284-296.

[0273] Joreskog, K. G., and Wold, H., 1982 Systems under IndirectObservation, North Holland, Amsterdam.

[0274] Kienk, H. P., et al., 1997, “The complete genome sequence of thehyperthermophilic, sulphate-reducing archaeon Archaeoglobus fulgidus,”Nature, Vol. 390, pp. 364-370.

[0275] Lindon, J. C., et al., 1999, “NMR spectroscopy of biofluids,” inAnnual Reports on NMR Spectroscopy (Webb, G. A., ed.), Academic Press(London), Vol. 38, pp. 1-88.

[0276] Lindon, J. C., Ferrige A. G., 1980, “Digitisation and DataProcessing in Fourier Transform NMR,” Progress in NMR Spectroscopy, Vol.14, pp. 27-66.

[0277] Moka, D., et al., 1998, “Biochemical classification of kidneycarcinoma biopsy samples using magic angle spinning NMR spectroscopy,”J. Pharm. Biomed. Anal., Vol. 17, pp. 125-132.

[0278] Nicholson, J. K. et al., 1989, “High resolution proton magneticresonance spectroscopy of biological fluids,” Prog. NMR Spectrosc., Vol.21, pp.449-501.

[0279] Nicholson, J. K. et al., 1995, “750 MHz ¹H and ¹H -¹³C NMRspectroscopy of human blood plasma,” Anal. Chem. Vol. 67, pp.793-811.

[0280] Nicholson, J. K. et al., 1999, “Metabonomics—understanding themetabolic response of living systems to pathophysiological stimuli viamultivariate statistical analysis of biological NMR spectroscopic data,”Xenobiotica, Vol. 29, pp. 1181-1189.

[0281] Nillson, N. J., 1965, Learning Machines, McGraw-Hill, New York.

[0282] Parzen, E., 1962, “On estimation of a probability densityfunction and mode,” Ann. Mathemat. Stat., Vol. 33, p. 1065.

[0283] Patterson, D., 1996, Artificial Neural Networks, Prentice Hall,Singapore.

[0284] Press, William H., Teukolsky, Saul A., Vetterling, William T.,Flannery, Brian P., January 1993, Numerical Recipes in C: The Art ofScientific Computing, 2nd edition, Cambridge University Press.

[0285] Quinlan, J. R., 1986, “Induction of decision trees,” MachineLearning, Vol. 1, p. 81.

[0286] Sharaf, M. A., et al., 1986, Chemometrics, J. Wiley & Sons, NewYork.

[0287] Speckt, D. F., 1990, “Probabilistic neural networks,” Neur.Networks, Vol. 3, p. 109.

[0288] Spraul, M. et al., 1994, “Automatic reduction of NMRspectroscopic data for statistical and pattern recognitionclassification of samples,” J. Pharm. Biomed, Anal., Vol. 12, pp.1215-1225.

[0289] Sze, D. Y., et al., 1994, “High-resolution proton NMR studies oflymphocyte extracts,” Immunomethods, Vol. 4, pp. 113-126.

[0290] Tomlins, A. M. et al., 1998, “High resolution magic anglespinning ¹H NMR analysis of intact prostatic hyperplastic and tumourtissues,” Anal. Comm., Vol. 35, pp. 113-115.

[0291] Tranter, G. E., et al., 1999, “Metabonomic prediction of drugtoxicity via probabilistic neural network analysis of NMR biofluiddata,” Abstr. 9^(th) North American ISSX Meeting, Oct. 24-28,1999, p.246.

[0292] Wasserman, P. D., 1989, Neural Computing: Theory and Practice,(Van Nostrand, ed.) Reinhold, New York, USA.

[0293] Wold, H., 1966, in Multivariate Analysis (P. R. Krishnaiah, Ed.)Academic Press, New York.

[0294] Wold, S., 1976, “Pattern recognition by means of disjointprincipal components models,” Pattern Recog., Vol. 8, p. 127.

1. A method for processing a sample spectrum comprising: replacing eachof one or more target regions in said sample spectrum with acorresponding replacement region of a master control spectrum to give atarget-replaced sample spectrum, wherein said replacement region hasbeen scaled so as to have the same fraction of the total integratedintensity in said target-replaced sample spectrum as it did in saidmaster control spectrum.
 2. A method for processing a sample spectrumfor a test sample, said method comprising the steps of; (a) identifying,in said sample spectrum, one or more target regions for replacement; (b)providing a master control spectrum which comprises one replacementregion corresponding to each of said target regions; and, (c) replacingeach of said target regions with the corresponding replacement region togive a target-replaced sample spectrum, wherein said replacement regionhas been scaled so as to have the same fraction of the total integratedintensity in said target-replaced sample spectrum as it did in saidmaster control spectrum.
 3. A method according to claim 2, furthercomprising the subsequent step of: (d) normalising said target-replacedsample spectrum to give a normalised target-replaced sample spectrum. 4.A method for processing a sample NMR spectrum for a test sample, saidmethod comprising the steps of: (a) identifying, in said sample NMRspectrum, one or more target regions for replacement, wherein each ofsaid target regions is defined by a chemical shift range; (b) providinga master control NMR spectrum which comprises one replacement regioncorresponding to each of said target regions, wherein a target regionand its corresponding replacement region are defined by the samechemical shift range; and, (c) replacing each of said target regionswith the corresponding replacement region to give a target-replacedsample NMR spectrum, wherein said replacement region has been scaled soas to have the same fraction of the total integrated intensity in saidtarget-replaced sample NMR spectrum as it did in said master control NMRspectrum.
 5. A method according to claim 4, further comprising thesubsequent step of: (d) normalising said target-replaced sample NMRspectrum to give a normalised target-replaced sample NMR spectrum.
 6. Amethod according to claim 2, wherein, in said replacing step (c), eachof said target regions is replaced with the corresponding replacementregion to give a target-replaced sample spectrum, wherein saidreplacement region has been scaled by a factor, f, given by the formula:$f = \frac{I_{Y} - {\sum\limits_{k}I_{Y,{T\quad k}}}}{I_{C\quad M} - {\sum\limits_{k}I_{{C\quad M},{R\quad k}}}}$

wherein: I_(Y) is the total integrated intensity of the sample spectrum;I_(Y,Tk) is the integrated Intensity of the target region; I_(CM) is thetotal integrated Intensity of the master control spectrum; I_(CM,Rk) isthe integrated intensity of the replacement region; k ranges from 1 ton_(t); and, nis number of target regions.
 7. A method according to claim3, wherein, in said replacing step (c), each of said target regions isreplaced with the corresponding replacement region to give atarget-replaced sample spectrum, wherein said replacement region hasbeen scaled by a factor, f, given by the formula:$f = \frac{I_{Y} - {\sum\limits_{k}I_{Y,{T\quad k}}}}{I_{C\quad M} - {\sum\limits_{k}I_{{C\quad M},{R\quad k}}}}$

wherein: I_(Y) is the total integrated intensity of the sample spectrum;I_(Y,Tk) is the integrated intensity of the target region; I_(CM) is thetotal integrated intensity of the master control spectrum; I_(CM,Rk) isthe integrated intensity of the replacement region; k ranges from 1 ton_(t); and, n_(t) is number of target regions.
 8. A method according toclaim 4, wherein, in said replacing step (c), each of said targetregions is replaced with the corresponding replacement region to give atarget-replaced sample spectrum, wherein said replacement region hasbeen scaled by a factor, f, given by the formula:$f = \frac{I_{Y} - {\sum\limits_{k}I_{Y,{T\quad k}}}}{I_{C\quad M} - {\sum\limits_{k}I_{{C\quad M},{R\quad k}}}}$

wherein: I_(Y) is the total integrated intensity of the sample spectrum;I_(Y,Tk) is the integrated intensity of the target region; I_(CM) is thetotal integrated intensity of the master control spectrum; I_(CM,Rk) isthe integrated intensity of the replacement region; k ranges from 1 ton_(t); and, n_(t) is number of target regions.
 9. A method according toclaim 5, wherein, in said replacing step (c), each of said targetregions is replaced with the corresponding replacement region to give atarget-replaced sample spectrum, wherein said replacement region hasbeen scaled by a factor, f, given by the formula:$f = \frac{I_{Y} - {\sum\limits_{k}I_{Y,{T\quad k}}}}{I_{C\quad M} - {\sum\limits_{k}I_{{C\quad M},{R\quad k}}}}$

wherein: I_(Y) is the total integrated intensity of the sample spectrum;I_(Y,Tk) is the integrated intensity of the target region; I_(CM) is thetotal integrated intensity of the master control spectrum; I_(CM,Rk) isthe integrated intensity of the replacement region; k ranges from 1 ton_(t); and, n_(t) is number of target regions.
 10. A sample spectrumwhich has been processed by a method according to claim
 1. 11. A samplespectrum which has been processed by a method according to claim
 2. 12.A sample spectrum which has been processed by a method according toclaim
 4. 13. A method for processing a plurality of sample spectra,comprising processing each of said sample spectra by a method accordingto claim
 1. 14. A method for processing a plurality of sample spectra,comprising processing each of said sample spectra by a method accordingto claim
 2. 15. A method for processing a plurality of sample spectra,comprising processing each of said sample spectra by a method accordingto claim
 4. 16. A method of analysis of an applied stimulus, comprisingthe steps of: (a) providing one or more sample spectra for each of oneor more samples from each of one or more organisms which have beensubjected to said applied stimulus; (b) providing a master controlspectrum derived from one or more control spectra for each of one ormore samples from each of one or more organisms which have not beensubjected to said applied stimulus; (c) processing each of said samplespectra using a method according to claim
 1. 17. A method of analysis ofan applied stimulus, comprising the steps of: (a) providing one or moresample spectra for each of one or more samples from each of one or moreorganisms which have been subjected to said applied stimulus; (b)providing a master control spectrum derived from one or more controlspectra for each of one or more samples from each of one or moreorganisms which have not been subjected to said applied stimulus; (c)processing each of said sample spectra using a method according to claim2.
 18. A method of analysis of an applied stimulus, comprising the stepsof: (a) providing one or more sample spectra for each of one or moresamples from each of one or more organisms which have been subjected tosaid applied stimulus; (b) providing a master control spectrum derivedfrom one or more control spectra for each of one or more samples fromeach of one or more organisms which have not been subjected to saidapplied stimulus; (c) processing each of said sample spectra using amethod according to claim
 4. 19. A method according to claim 16, whereinsaid applied stimulus is a xenobiotic.
 20. A method according to claim16, wherein said applied stimulus is a disease state.
 21. A methodaccording to claim 16, wherein said applied stimulus is a geneticmodification.
 22. A method for identifying a biomarker or biomarkercombination for an applied stimulus, comprising a method of analysisaccording to claim
 16. 23. A biomarker or biomarker combinationidentified by a method according to claim
 22. 24. A method of diagnosisof an applied stimulus employing a biomarker identified by a methodaccording to claim
 22. 25. An assay which employs a biomarker identifiedby a method according to claim
 22. 26. A method of classifying anapplied stimulus, comprising a method of analysis according to claim 16.27. A method of diagnosis of an applied stimulus, comprising a method ofanalysis according to claim
 16. 28. A method of therapeutic monitoringof a subject undergoing therapy, comprising a method of analysisaccording to claim
 16. 29. A method of evaluating drug therapy and/ordrug efficacy, comprising a method of analysis according to claim 16.30. A method of detecting toxic side-effects of drug, comprising amethod of analysis according to claim
 16. 31. A method of characterisingand/or identifying a drug in overdose, comprising a method of analysisaccording to claim
 16. 32. A method according claim 16, wherein saidspectrum or spectra is an NMR spectrum or NMR spectra.
 33. A computersystem operatively configured to implement a method according toclaim
 1. 34. Computer code suitable for implementing a method accordingto claim 1 on a suitable computer system.
 35. A data carrier whichcarries computer code suitable for implementing a method according toclaim 1 on a suitable computer system.